Sampled Weighted Min-Hashing for Large-Scale Topic Mining

نویسندگان

  • Gibran Fuentes Pineda
  • Iván V. Meza
چکیده

We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term cooccurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SWMH topics are ordered subsets of such vocabulary. Interestingly, the topics mined by SWMH underlie themes from the corpus at different levels of granularity. We extensively evaluate the meaningfulness of the mined topics both qualitatively and quantitatively on the NIPS (1.7K documents), 20 Newsgroups (20K), Reuters (800K) and Wikipedia (4M) corpora. Additionally, we compare the quality of SWMH with Online LDA topics for document representation in classification.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable Detection of Emerging Topics and Geo-spatial Events in Large Textual Streams

Key Ideas of our Solution • From statistics: control charts for change detection. • From computational linguistics: Analyze word cooccurrences for more meaningful results. • From mathematics: Exponentially weighted moving averages for streaming operation. • From databases: Hashing and Count-Min sketches for scalability to large data. • From data mining: Clustering of word pairs into simple “top...

متن کامل

Tag-Weighted Topic Model For Large-scale Semi-Structured Documents

To date, there have been massive Semi-Structured Documents (SSDs) during the evolution of the Internet. These SSDs contain both unstructured features (e.g., plain text) and metadata (e.g., tags). Most previous works focused on modeling the unstructured text, and recently, some other methods have been proposed to model the unstructured text with specific tags. To build a general model for SSDs r...

متن کامل

Min-wise independent sampling from skewed data streams

Min-wise independent hashing is a powerful sampling technique for estimating the similarity between sets. In particular, it has proved to be ubiquitous for mining data streams of large volume where the input sets are revealed in arbitrary order and the elements in a given set do not arrive consecutively. More precisely, for sets of elements E and attributes A the input is a stream of element-at...

متن کامل

Image authentication using LBP-based perceptual image hashing

Feature extraction is a main step in all perceptual image hashing schemes in which robust features will led to better results in perceptual robustness. Simplicity, discriminative power, computational efficiency and robustness to illumination changes are counted as distinguished properties of Local Binary Pattern features. In this paper, we investigate the use of local binary patterns for percep...

متن کامل

Multiple Feature Hashing Learning for Large-Scale Remote Sensing Image Retrieval

Driven by the urgent demand of remote sensing big data management and knowledge discovery, large-scale remote sensing image retrieval (LSRSIR) has attracted more and more attention. As is well known, hashing learning has played an important role in coping with big data mining problems. In the literature, several hashing learning methods have been proposed to address LSRSIR. Until now, existing ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015